Skip to main content

Early Web Datasets

Various datasets created for research, scholarly, or general use, generated from the "early web" era (1996-1999) of the Internet Archive's global web archive collection.


rss RSS

2
RESULTS


Show sorted alphabetically

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Early Web Datasets
data

eye 58

favorite 0

comment 0

This dataset consists of metadata records identifying the language of webpages from the early web (1996-1999) portion of the Internet Archive’s web archive collection. The metadata includes specific capture information related to a URL along with the top 3 language annotations for each page, as detected by Google’s Compact Language Detector (CLD3) , a neural network model for language identification. Confidence scores are computed according to Internet Archive’s own Square-Leaf-Model,...
Early Web Datasets
data

eye 89

favorite 0

comment 0

This dataset consists of multi-language URLs of the early web (1996-1999), grouped by common URL patterns. “Parallel language” refers to the same text represented in different languages; multi-language text from websites are a rich source for parallel language corpora and can be valuable in work such as machine translation. The URLs and patterns were derived and extracted from the Internet Archive’s Wayback Machine web archive, for all archived webpages successfully captured before year...